A Phonetic-Based Approach to Chinese Chat Text Normalization
نویسندگان
چکیده
Chatting is a popular communication media on the Internet via ICQ, chat rooms, etc. Chat language is different from natural language due to its anomalous and dynamic natures, which renders conventional NLP tools inapplicable. The dynamic problem is enormously troublesome because it makes static chat language corpus outdated quickly in representing contemporary chat language. To address the dynamic problem, we propose the phonetic mapping models to present mappings between chat terms and standard words via phonetic transcription, i.e. Chinese Pinyin in our case. Different from character mappings, the phonetic mappings can be constructed from available standard Chinese corpus. To perform the task of dynamic chat language term normalization, we extend the source channel model by incorporating the phonetic mapping models. Experimental results show that this method is effective and stable in normalizing dynamic chat language terms.
منابع مشابه
Improving Text Normalization using Character-Blocks Based Models and System Combination
There are many abbreviation and non-standard tokens in SMS and Twitter messages. Normalizing these non-standard tokens will ease natural language processing modules for these domains. Recently, character-level machine translation (MT) and sequence labeling methods have been used for this normalization task, and demonstrated competitive performance. In this paper, we propose an approach to segme...
متن کاملA Unified Framework for Text Analysis in Chinese TTS
This paper presents a robust text analysis system for Chinese text-tospeech synthesis. In this study, a lexicon word or a continuum of non-hanzi characters with the same category (e.g. a digit string) are defined as a morpheme, which is the basic unit forming a Chinese word. Based on this definition, the three key issues concerning the interpretation of real Chinese text, namely lexical disambi...
متن کاملPhonetic normalization using z-score in segmental prosody estimation for corpus-based TTS system
Recently, corpus-based text-to-speech (CB-TTS) has been actively studied through the world. Statistical training methods are generally applied for prosodic rules in CB-TTS, and classification and regression tree (CART) is one of the mostly used methods. In this paper, we present an efficient CART training approach of zscore based phonetic normalization. The idea of ours comes from the fact that...
متن کاملA Graph-based Approach for Contextual Text Normalization
The informal nature of social media text renders it very difficult to be automatically processed by natural language processing tools. Text normalization, which corresponds to restoring the non-standard words to their canonical forms, provides a solution to this challenge. We introduce an unsupervised text normalization approach that utilizes not only lexical, but also contextual and grammatica...
متن کاملA Rule - Based Text - to - Speech System for Portuguese
This paper describes the latest progress in the development of a text-to-speech system for Portuguese. The system comprises 4 major modules: text normalization, linguistic and phonetic processing, generation of the synthesizer parameters and synthesis. The present rule-based version, based on the Klatt80 formant synthesizer, has achieved promising results, namely in what concerns the performanc...
متن کامل